Base qualities help sequencing software.
نویسندگان
چکیده
With the complete sequencing of the human genome under way and the sequencing of complete microorganism genomes becoming commonplace, we have truly entered the era of large-scale DNA sequencing. Why now? As in some other data-rich areas of modern biology, for example, protein structure determination, it can be argued that the ratelimiting factors in increasing efficiency and throughput have been computer power and software. We could have run thousands of sequencing gels 20 years ago, but without image-processing software and fragment assembly packages it would not have been feasible to put together all of the individual sequence fragments from the gels to give megabases of continuous, accurate sequence. At any rate, the development of powerful computational tools is central to large-scale sequencing. This special informatics issue contains several papers on the software used in genome sequencing centers, and in particular three papers on the set of programs from Phil Green’s group at the University of Washington in Seattle (Ewing and Green 1998; Ewing et al. 1998; Gordon et al. 1998). These programs have played a key role in the progress of the largest-scale projects under way. They have been used extensively in the 100-Mb Caenorhabditis elegans project being completed this year and predominate among groups sequencing the human genome. Such sequencing groups start with large clones such as BACs or PACs of 100 kb or more, or small genomes of up to a few megabases, for which the goal is to obtain complete accurate sequence. However, the raw sequences, or ‘‘reads,’’ obtained from the gels run on automated machines such as ABI 377s are only on the order of 500–1000 bp long and contain errors, particularly at the start and end of the read. To build up the longer sequence, many large-scale projects use a shotgun strategy, in which the first step is to collect thousands of primary reads from random subclones. These are pieced together by assembly software based on overlaps detected by sequence comparison. Following assembly, the sequence is made contiguous and accurate by adding extra ‘‘finishing’’ reads selected from the subclones to fill gaps and cover ambiguous regions where the primary data did not give sufficiently reliable information. The goals of computer software in this process are to (1) make the most of the available data, so as to minimize costly data collection, and (2) reduce and simplify human interaction by a combination of clever algorithms and good ergonomics. Currently no system works in a completely automated fashion; there are some pattern recognition and analysis tasks that humans still perform much better than our software does. We support the view expressed by Churchill and Waterman (1992), that it will continue to be important to involve human input, targeted at progressively more specific cases, and via progressively better interfaces. This will both improve overall accuracy, and, importantly, provide the source of new ideas for increasing automation. Simplistically, sequencing software is involved in three stages: (1) obtaining the primary read data from the gel images; (2) assembling the reads into the correct map to derive a consensus; and (3) supporting the finishing process. The first two are essentially automatic, but for now the last is interactive, involving human input to make those remaining decisions that cannot yet be left reliably to computers. A number of different software packages have been developed to handle these tasks over the years, in both academic and commercial settings. Until recently, these dealt exclusively with base sequences determined from the reads. Where bases disagreed because of errors, either sufficient reads had to be present for a clear consensus to be obtained (which might still be wrong) or a user had to examine the original trace data manually. To minimize editing, the reads were conservatively clipped to avoid the lower accuracy regions at the ends. Programs such as GAP (Dear and Staden 1991; Bonfield et al. 1995), followed by many others, made this manual editing process much easier by presenting aligned trace data graphically, but editing continued to be a significant bottleneck. The major innovation of the software from Phil Green’s group has been to always keep an error probability measure, known as a ‘‘quality,’’ attached to each base prediction, either in a read or in the consensus. The initial quality values are obtained by the program phred (Ewing and Green 1998; Ewing et al. 1998), which makes base and quality calls for each read from the raw trace data. The assembly program phrap (P. Green, pers. comm.) uses the qualities both to significantly improve assembly and also to give a more accurate consensus sequence. Finally, the interactive program consed (Gordon et al. 1998) works in tight conjunction with phrap to provide a finishing environment, with an emphasis on editing the quality values and reassembly using these together with new finishing reads, so as to minimize editing the base calls themselves in the traditional fashion. Using estimates of confidence per base is not a new idea, for example, see Lawrence and Solovyev (1994) and Bonfield and Staden (1995), but the phred/phrap/consed package is perhaps the first to use it in such a central and ubiquitous fashion. One of the most important gains coming from systematic use of qualities is that clipping is no longer needed before sequence assembly: The entire read length can be used. This has made an enormous difference for assembling human genomic sequence, ∼35% of E-MAIL [email protected]; [email protected]; FAX 1223-494919. Insight/Outlook
منابع مشابه
Transcriptome Sequencing of Guilan Native Cow in Comparison with bosTau4 Reference Genome
RNA-sequencing is a new method of transcriptome characterization of organisms. Based on identity and relatedness, there are large genetic variations among different cattle breeds. The goal of the current study was to sequence the transcriptome of Guilan native cow and compare with available reference genome using RNA-sequencing method. Blood samples were collected from 14 Guilan native cows and...
متن کاملSNP calling using genotype model selection on high-throughput sequencing data
MOTIVATION A review of the available single nucleotide polymorphism (SNP) calling procedures for Illumina high-throughput sequencing (HTS) platform data reveals that most rely mainly on base-calling and mapping qualities as sources of error when calling SNPs. Thus, errors not involved in base-calling or alignment, such as those in genomic sample preparation, are not accounted for. RESULTS A n...
متن کاملAccurate estimation of short read mapping quality for next-generation genome sequencing
MOTIVATION Several software tools specialize in the alignment of short next-generation sequencing reads to a reference sequence. Some of these tools report a mapping quality score for each alignment-in principle, this quality score tells researchers the likelihood that the alignment is correct. However, the reported mapping quality often correlates weakly with actual accuracy and the qualities ...
متن کاملMapping short DNA sequencing reads and calling variants using mapping quality scores.
New sequencing technologies promise a new era in the use of DNA sequence. However, some of these technologies produce very short reads, typically of a few tens of base pairs, and to use these reads effectively requires new algorithms and software. In particular, there is a major issue in efficiently aligning short reads to a reference genome and handling ambiguity or lack of accuracy in this al...
متن کاملUser perceptions of effects of training: In search for qualities in use
Within HCI and Interaction design we have tended to overlook the potential in people adapting to technology, rather than the other way around. What learning contributes to usability, remains largely a white space on the map. Such knowledge could enable the development of more usable systems, and help focusing software developers and the after-market of software on the same set of qualities; qua...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Genome research
دوره 8 3 شماره
صفحات -
تاریخ انتشار 1998